Notebook Description

Often we want to compare if two files with seizure annotations contain the same annotations. For example, if you look through a week of recordings and annotate the sezures, comparing a classifier’s predictions with your annotations will allow you to check the number of false positives and (more importantly) false negatives.

In [31]:
import os

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

For an example we will look at the difference between a raw predictions output and the same file after it has been manually checked and the false postives removed using the gui.

In [42]:
checked_preds = pd.read_csv('./example_data/checked_predictions.csv', index_col=0)
raw_preds     = pd.read_csv('./example_data/raw_predictions.csv', index_col=0)

checked_preds = pd.read_csv('/media/jonathan/My Passport-ELE -EEG/2019/analysis pyecog/01b_baseline_annotation_only.csv',
                             index_col=0)
raw_preds     = pd.read_csv('/media/jonathan/My Passport-ELE -EEG/2019/analysis pyecog/prediction from Tawfeeq Library/predictions_baselineconverted01.csv',
                            index_col=0)
#
path= r'nathan\My Passport-ELE -EEG\2019/analysis pyecog/prediction from Ta'
raw_preds = pd.read_csv(path, index_col=0)

In [60]:
x = 5
print(x)
x = 10
print(x)
x
5
10
Out[60]:
10
In [ ]:
# note to jonny 2019 :

The file output by the clf does not havethe [] areound the transmitter. Whereas
In [43]:
raw_preds.head()
Out[43]:
old_index filename start end duration transmitter real_start real_end
0 0 M1566408624_2019-08-21-18-30-24_tids_[142].h5 125.0 285.0 160.0 [142] 2019-08-21 18:32:29 2019-08-21 18:35:09
1 1 M1566459024_2019-08-22-08-30-24_tids_[119].h5 1965.0 2025.0 60.0 [119] 2019-08-22 09:03:09 2019-08-22 09:04:09
2 2 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2125.0 2195.0 70.0 [119] 2019-08-22 09:05:49 2019-08-22 09:06:59
3 3 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2290.0 2305.0 15.0 [119] 2019-08-22 09:08:34 2019-08-22 09:08:49
4 4 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2380.0 2425.0 45.0 [119] 2019-08-22 09:10:04 2019-08-22 09:10:49
In [44]:
raw_preds.shape
Out[44]:
(112, 8)
In [45]:
checked_preds.shape
Out[45]:
(77, 8)
In [46]:
checked_preds.head()
Out[46]:
old_index filename start end duration transmitter real_start real_end
2 414 M1567629024_2019-09-04-21-30-24_tids_[28, 33, ... 860.18 939.98 79.80 [119] 04/09/2019 21:44 04/09/2019 21:46
4 407 M1567611024_2019-09-04-16-30-24_tids_[28, 33, ... 2976.58 3031.56 54.98 [119] 04/09/2019 17:20 04/09/2019 17:20
7 380 M1567524624_2019-09-03-16-30-24_tids_[28, 33, ... 26.26 68.79 42.53 [119] 03/09/2019 16:30 03/09/2019 16:31
9 377 M1567521024_2019-09-03-15-30-24_tids_[28, 33, ... 3298.90 3347.56 48.66 [119] 03/09/2019 16:25 03/09/2019 16:26
10 376 M1567521024_2019-09-03-15-30-24_tids_[28, 33, ... 2996.26 3057.73 61.47 [119] 03/09/2019 16:20 03/09/2019 16:21
In [47]:
print('So we expect there to be', raw_preds.shape[0]-checked_preds.shape[0], 'false positives')
So we expect there to be 35 false positives

Code for comparing the dataframes

In [48]:
def add_mcode_tid_col(df):
    '''Note this expects file start to be of format: M1513966209'''
    df['mcode_tid'] = df.filename.str.slice(0,11)+'_'+df.transmitter.astype(str)
    return df

def check_overlap(series1,series2):
    ''' pandas series should both have start and end columns
    http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
    '''
    start_a, end_a = float(series1.start), float(series1.end)
    start_b, end_b = float(series2.start), float(series2.end)
    overlap_bool = (start_a <= end_b) and (end_a>=start_b)
    return overlap_bool

def calculate_overlap(series1,series2):
    ''' pandas series should both have start and end attrs
    http://baodad.blogspot.co.uk/2014/06/date-range-overlap.html
    '''
    a, b = float(series1.start), float(series1.end)
    c, d = float(series2.start), float(series2.end)
    overlap = min([b-a,b-c,d-c,d-a])
    return overlap

def compare_dfs(prediction_df, annotation_df):
    '''
    Function to check how much of prediction_df is found in annotation_df

    Returns two dataframes:
        - preds_in_annotations_df: the predictions found within the annotations and the amount of overlap
          (probably corresponding to true positives). Has an overlap column (seconds).
        - preds_not_in_annotations_df: the predictions not found in the annotations
          (probably corresponding to false positives)

    Note:
        To check for false negatives, or missed seizures, pass in the actual annotations
        as the 'prediction_df' and the actual predictions as the 'annotation_df'.
        Here we would hope for no 'false positives', so the second dataframe will contain the
        missed seizures...
    '''

    # first add a 'mcode_tid' col: allows us to check for same hour and transmitter
    prediction_df = add_mcode_tid_col(prediction_df)
    annotation_df = add_mcode_tid_col(annotation_df)

    # Create empty dataframes that we will add to below
    preds_in_annotations_df     = pd.DataFrame(columns = prediction_df.columns)
    preds_not_in_annotations_df = pd.DataFrame(columns = prediction_df.columns)

    # loop over the predictions
    for _, prediction_row_series in prediction_df.iterrows():
        overlap_bool = False # boolean for if the predicted seizure at all overlaps with an annotation
        # first check if the hour&transmitter is in the annotations
        if prediction_row_series.mcode_tid in annotation_df.mcode_tid.unique():

            # next find all annotations with same hour and tid as the prediction row
            # this will often just be one row, but if >1 seizures in a single hour will be more
            revevant_annotations_df = annotation_df[annotation_df.mcode_tid.isin([prediction_row_series.mcode_tid])]

            # finally check if the start and end columns overlap
            t_overlap = 0 # store the overlap time between preds and seizures
            for _, annotation_row_series in revevant_annotations_df.iterrows():
                row_overlap   =  check_overlap(prediction_row_series,
                                               annotation_row_series) # in the case that two seizures, want to add...
                overlap_bool  += row_overlap
                if row_overlap: # is this robust to two seizures?
                    t_overlap += calculate_overlap(prediction_row_series,
                                                   annotation_row_series)

        if overlap_bool>0:
            prediction_row_series['overlap'] = t_overlap
            preds_in_annotations_df   = preds_in_annotations_df.append(prediction_row_series)
        else:
            preds_not_in_annotations_df = preds_not_in_annotations_df.append(prediction_row_series)

    return preds_in_annotations_df, preds_not_in_annotations_df

true_positives, false_positives = compare_dfs(raw_preds,checked_preds)
In [49]:
true_positives.shape, false_positives.shape
Out[49]:
((67, 10), (45, 9))
In [50]:
true_positives.head()
Out[50]:
old_index filename start end duration transmitter real_start real_end mcode_tid overlap
1 1.0 M1566459024_2019-08-22-08-30-24_tids_[119].h5 1965.0 2025.0 60.0 [119] 2019-08-22 09:03:09 2019-08-22 09:04:09 M1566459024_[119] 58.43
2 2.0 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2125.0 2195.0 70.0 [119] 2019-08-22 09:05:49 2019-08-22 09:06:59 M1566459024_[119] 39.32
3 3.0 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2290.0 2305.0 15.0 [119] 2019-08-22 09:08:34 2019-08-22 09:08:49 M1566459024_[119] 15.00
4 4.0 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2380.0 2425.0 45.0 [119] 2019-08-22 09:10:04 2019-08-22 09:10:49 M1566459024_[119] 43.08
5 5.0 M1566459024_2019-08-22-08-30-24_tids_[119].h5 2585.0 2635.0 50.0 [119] 2019-08-22 09:13:29 2019-08-22 09:14:19 M1566459024_[119] 50.00
In [51]:
false_positives.head()
Out[51]:
old_index filename start end duration transmitter real_start real_end mcode_tid
0 0.0 M1566408624_2019-08-21-18-30-24_tids_[142].h5 125.0 285.0 160.0 [142] 2019-08-21 18:32:29 2019-08-21 18:35:09 M1566408624_[142]
21 21.0 M1566765024_2019-08-25-21-30-24_tids_[119].h5 875.0 950.0 75.0 [119] 2019-08-25 21:44:59 2019-08-25 21:46:14 M1566765024_[119]
23 23.0 M1566822624_2019-08-26-13-30-24_tids_[119].h5 2545.0 2625.0 80.0 [119] 2019-08-26 14:12:49 2019-08-26 14:14:09 M1566822624_[119]
24 24.0 M1566822624_2019-08-26-13-30-24_tids_[119].h5 2715.0 2760.0 45.0 [119] 2019-08-26 14:15:39 2019-08-26 14:16:24 M1566822624_[119]
25 25.0 M1566822624_2019-08-26-13-30-24_tids_[119].h5 2870.0 2910.0 40.0 [119] 2019-08-26 14:18:14 2019-08-26 14:18:54 M1566822624_[119]

Here save the false positives to check through using the gui

  • they might not all be false positives!
In [11]:
savename = 'predictions_not_in_annotations.csv'
false_positives.to_csv(savename,header=False, index=False)

Here flip the order of dataframes:

Note:
        To check for false negatives, or missed seizures, pass in the actual annotations
        as the 'prediction_df' and the actual predictions as the 'annotation_df'.
        Here we would hope for no 'false positives', so the second dataframe will contain the
        missed seizures...

Pass in checked preds as the predctions. This is similar to the case where you are checking for missed seizures

In [52]:
true_positives, false_positives = compare_dfs(checked_preds,raw_preds)
In [53]:
true_positives.shape
Out[53]:
(69, 10)
In [56]:
false_positives.shape
savename = 'annotations_not_in_predictions.csv'
false_positives.to_csv(savename,header=True, index=False)

as expected no false positives. If there were these would be annoations that had been missed (if the predictions were over the same time period as the annotations)

In [57]:
false_positives.shape
Out[57]:
(8, 9)
In [ ]: